The Inverse of Exact Renormalization Group Flows as Statistical Inference

We build on the view of the Exact Renormalization Group (ERG) as an instantiation of Optimal Transport described by a functional convection–diffusion equation. We provide a new information-theoretic perspective for understanding the ERG through the intermediary of Bayesian Statistical Inference. This connection is facilitated by the Dynamical Bayesian Inference scheme, which encodes Bayesian inference in the form of a one-parameter family of probability distributions solving an integro-differential equation derived from Bayes’ law. In this note, we demonstrate how the Dynamical Bayesian Inference equation is, itself, equivalent to a diffusion equation, which we dub Bayesian Diffusion. By identifying the features that define Bayesian Diffusion and mapping them onto the features that define the ERG, we obtain a dictionary outlining how renormalization can be understood as the inverse of statistical inference.


Introduction
The Renormalization Group (RG) is used in physical settings to deduce how a theory changes when it is viewed at different scales.In Wilsonian RG, one regards the set of possible theories as a space coordinatized by a collection of coupling constants specifying all of the possible contributions to a classical action.An RG flow can then be understood as a one-parameter family, or flow, in the space of theories generated by a vector field on theory space referred to as the Beta Function [1].The flow is completely specified once the requisite initial data are provided, for example, fixing a theory in the ultraviolet (UV) from which the RG flow begins (by fixing a theory in the UV, we mean a theory that is a priori valid at all energy scales).
A typical approach to RG is to deduce the beta function by sequentially coarse-graining the theory-that is, integrating out degrees of freedom that become suppressed at lower energy scales.In field theoretic contexts, this approach makes use of perturbative techniques in order to perform the requisite functional integrals.From a formal perspective, however, it is possible to study the properties of renormalization as an abstract flow equation without immediately concerning ourselves with any complicated or even intractable calculations that may be required to realize the flow explicitly.This approach to RG goes under the name of the Exact Renormalization Group (ERG) and will be the primarily focus of our paper.
In our approach to ERG, we shall regard a Quantum Field Theory (QFT) as equivalent to the specification of a probability distribution on the space of fields included in the theory, which we denote by F(M) and refer to as the sample space in standard probability theoretic nomenclature (to accommodate this perspective, we shall consider only Euclidean QFT).
From this perspective, the space of theories is isomorphic to the space of probability distributions on the sample space F(M), denoted by M.An ERG flow is therefore equivalent to a one-parameter family of probability distributions on M.
From this point of view, we can still make contact with the Wilsonian picture by regarding a Wilsonian Effective Action, written in terms of a collection of coupling constants, as specifying a parameteric family of probability distributions.In other words, Wilsonian RG corresponds to a particular coordinatization of the space M.
Following the lead of [2][3][4] and others, it is natural to interpret the one-parameter family of probability distributions generated by an ERG flow as being governed by a functional convection-diffusion equation on the sample space F(M).Regarding ERG as a flow on the space of probability distributions over a given sample space contextualizes renormalization in a language that is amenable to applications outside of the usual realm of physical theory.In particular, it suggests a manifestly information-theoretic interpretation for renormalization.Uncovering and presenting the details of this interpretation is the primarily objective of this note.
In seeking an answer to the question, "What is the information-theoretic Interpretation of an ERG flow?" we find it useful to consider a related question: "What does it mean to invert an ERG flow?".Here, our understanding of ERG as being governed by a diffusion equation provides us with a direction.A powerful method for formally inverting a diffusion process is Bayesian Inversion [5].Bayesian Inversion is a probabilistic approach used to determine the initial data that was fed into a partial differential equation and subsequently generated an observed sequence of outcomes.As the name implies, the main tool employed in Bayesian Inversion is Bayes' Law.This lends some credence to the idea that Bayesian Inference may serve as an "inverse process" to ERG.
In our previous publication [6], we introduced Dynamical Bayesian Inference, which recasts the Bayesian Inference as a dynamical system.Like ERG, Dynamical Bayesian Inference generates a one-parameter family of probability distributions satisfying a flow equation.Whereas in ERG, the flow equation is deduced by sequential application of a coarse-graining law, in Dynamical Bayes the flow equation is obtained by the sequential application of Bayes' law with respect to a continuously growing set of observed data.In this respect, an ERG flow continuously loses information (hence, it is diffusive), while a Dynamical Bayesian flow continuously gains information.
In this note, we shall argue in favor of this interpretation.In particular, we will demonstrate that the Dynamical Bayesian Flow equation gives rise to a convection-diffusion equation describing the evolution of an associated posterior predictive distribution backwards against the collection of additional data.We refer to the process described by this equation as Bayesian Diffusion, or Backward Inference.We interpret the equation governing Bayesian Diffusion as defining an ERG flow, with a coarse-graining procedure given by the continuous discarding of observed data.By construction, this ERG is inverted by the forward Dynamical Bayesian inference flow that is obtained by reincorporating the lost data back into the model.Alternatively, starting from an ERG flow we identify the Dynamical Bayesian flow for which it is the Backward Inference process by drawing a correspondence between the partial differential equation governing ERG and the partial differential equation governing Bayesian diffusion.Ultimately, this correspondence suggests a fascinating information-theoretic interpretation for ERG: it can be regarded as the one-parameter family of probability distributions obtained by starting from a data-generating model and continuously throwing away information in the form of observed data.
The organization of the paper is given as follows: in Section 2, we review the formulation of ERG as a functional differential equation, specifically in the Wegner-Morris form.We also demonstrate that the Wegner-Morris equation is equivalent to a Fokker-Planck equation with a given potential function, establishing the correspondence between ERG and optimal transport.In Section 3, we introduce Stochastic Differential Equations (SDEs) and discuss the relationship between SDEs and partial differential equations of the type appearing in ERG.In Section 4, we review Dynamical Bayesian Inference and derive the Bayesian Diffusion equations.Finally, in Section 5, we establish the correspondence between Bayesian Diffusion and ERG explicitly.We conclude with Section 6, in which we review our findings and discuss future research directions.

ERG Equation as a Functional Diffusion Equation
In this section, we will provide an overview of functional renormalization with the objective of demonstrating how ERG can be understood as a functional differential equation.Our presentation will only focus on the aspects of ERG that are relevant for this work; for a more complete overview, see [2,4,7,8] and the references therein.The presentation here follows very closely with that in [4].
Let us consider a theory with fields Φ ∈ F(M).Here, we are using a notation in which capital letters correspond to random variables, and lower case letters to realizations.In this case, Φ is to be regarded as a random variable taking values in a space of continuous functions on a manifold M. The distribution over Φ is a probability functional where S[ϕ(x)] is the Euclidean action.The idea of RG is that we are only capable of probing scales less than a cutoff Λ.As we change our cutoff, we obtain a family of probability distributions, P Λ [ϕ(x)] ∝ e −S Λ [ϕ(x)] , corresponding to effective descriptions for the field Φ at each measurement threshold.ERG provides a description for the changes to the effective probability distribution in the form of a flow equation: where F is a functional of P Λ [ϕ] and all its functional derivatives and specifies a particular ERG scheme.

Polchinski's Equation
The canonical example of ERG is Polchinski's approach [2].We begin with the partition function of a scalar field with source J given by: Here, S int,Λ is the interacting action, and K Λ (p 2 ) is a cutoff function that differentially weighs modes ϕ(p) based on their momenta.Polchinski's idea was to consider a scale Λ R < Λ and integrate out modes down to Λ R .To this end, we assume that J(p) has compact support in the sphere of momenta with radius Λ R − ϵ for a small ϵ > 0. If Λ R is only infinitesimally smaller than Λ, we can compute the differential change to Z Λ [J] on account of integrating out the shell of modes between Λ and Λ R .Then, we demand that: where A Λ is a constant for each value of Λ.If (3) holds, any correlation functions below the changed scale (i.e., for which one can take functional derivatives with respect to J) will be unchanged.Hence, this form of RG respects the fact that measured correlation functions ought to be independent of the RG flow below the relevant momentum scale.
Expanding out (3), we find: Again, K Λ (p 2 ) is a function with prescribed Λ dependence.The behavior governed by (3) is that of S int,Λ .Polchinski showed that one can consistently satisfy (3) by taking S int,Λ to satisfy the functional differential equation: Following the approach of (1), we define the probability functional: P Λ [ϕ] = e −S Λ [ϕ] /Z Λ , which is explicitly the probability distribution for the field Φ at scale Λ.Then, we can truly write the Polchinski equation in the form of a convection-diffusion equation as: Throughout this note, we shall emphasize the sense in which such equations can be understood as functional analogs of more well defined finite dimensional equations.For example, (6) ought to be compared to the finite dimensional equation: where we have written sums explicitly to make the analogy clear.To move between the finite dimensional equation and the Polchinski equation, we have made use of the dictionary found in Table 1.

Wegner-Morris Flows
We have now established that Polchinski's equation is simply an infinite dimensional convection-diffusion equation.It can be shown that a family of such equations exists that satisfies the constraint (3).This family is defined by considering different choices for the metric and drift velocity appearing in Table 1.The choice of this data then corresponds to a choice of scheme for the ERG flow.The aforementioned family of ERG schemes is given explicit representation in terms of the Wegner-Morris flow equation [9][10][11][12][13].As a one-parameter family of probability distributions, the Wegner-Morris flow is governed by the equation: Table 1.Dictionary relating Finite Dimensional Diffusion and Polchinski's ERG equation.

Time Parameter
The Wegner-Morris scheme is encapsulated in the kernel Ψ Λ [ϕ, x].Again, it is useful to compare to a finite dimensional flow equation: Here, Ψ Λ has been replaced by a vector field that depends not only on y but on the entire probability function p t .It is often natural to regard V as the gradient of a scalar function W(p t , y), which also depends on p t , i.e., V i = g ij (y) ∂ ∂y j W(p t , y).Then, we can write: A natural interpretation of ( 8) is that it simply reparameterizes the field at each new scale.To be precise, as Λ changes we find: This implies that the Wegner-Morris flow conserves probability; it only changes the way that the probability is distributed.This is the reason why the Wegner-Morris family of flow equations satisfies (3).On account of this new interpretation, Ψ Λ is given the title of the Reparameterization Kernel.
Following the intuition from the finite dimensional case, we can represent the Reparameterization Kernel as a functional gradient: Here, ĊΛ (x, y) is playing the same role it did in the Polchinski equation as an inverse metric on the sample space of fields; however, we have not yet fixed its functional form, and, indeed, it may differ from Polchinski's choice.In the literature, ĊΛ (x, y) is called the ERG kernel.
A popular choice for Σ Λ is given by the difference between the renormalized action of the theory and a second action, ŜΛ , called the seed action: The factor of two is conventional.In this framework, an RG scheme is therefore specified entirely by a choice of ĊΛ and ŜΛ .For example, we can reconcile the Polchinski equation from the Wegner-Morris set up by taking:

Field Reparameterization and Scheme Independence
In Quantum Field Theory, fields are not an observable quantity but rather a device used to encode a theory.In this respect, we should regard field parametrization as a choice of coordinates on the sample space of the theory and require that the theory be invariant under diffeomorphisms of these coordinates.Following this line of thought, the author of [3] showed that the field reparamterization appearing in (11) can be interpreted as a gauge redundancy, and, as a consequence, the drift component of the Wegner-Morris equation can be interpreted as a choice of gauge.This perspective can be made concrete by viewing the convective derivative of the resulting flow equation as a covariant derivative, with the vector field generating the drift playing the role of a gauge field.Changing the drift field changes the description of the ERG only cosmetically; in particular, the expectation values of relevant operators are left-invariant under a change of gauge.We shall adopt this perspective and regard the prescription of drift in an ERG as tantamount to a choice of scheme.

Wegner-Morris and Fokker-Planck
In the previous subsection, we saw that ERG can be understood as a functional differential equation associated with the Wegner-Morris Equation (9).In this section, we shall demonstrate how the Wegner-Morris equation may be regarded as a form of the Fokker-Planck equation.For simplicity, we shall work in the context of probability theory on a differentiable sample space, S.
Given a function F : M → R on a (psuedo)-Riemannian manifold M with metric g, we define the gradient as follows: for any path γ : [0, 1] → M, with γ(0) = x, the gradient of F at x with respect to the metric g is the tangent vector grad g F(x) such that: An equivalent definition that does not require the introduction of a path simply defines the gradient in terms of the exterior derivative on M: dF(X) = g(grad g F, X); grad g F = (dF)\ .(16) Here, the map \ : T * M → TM corresponds to the usual notion of index raising via the metric, mapping one form into vectors.To be precise, given α, β ∈ T * M we have: Let M = dens(S) denote the manifold of probability distribution functions on the sample space S. The tangent space to M at a point p is defined by: This is a good definition since functions η ∈ TM can be regarded as perturbations to probability densities that do not spoil the integral normalization (⋆ is the Hodge star on S): Hence, perturbations p → p + η still belong to the manifold M. Now, as a first exercise in understanding gradients on M, let us consider the most straightforward combination of metric and scalar function on this manifold.M is naturally equipped with an ℓ 2 metric of the form Moreover, there is a natural functional to consider, namely, the Dirichlet Energy Functional E : M → R defined by (d is the exterior derivative on S not on M): The exterior derivative on M should be understood as the variational derivative with respect to the density p.Thus, by standard techniques, we can compute: A probability distribution must necessarily have bounded support to satisfy an integral normalization condition; hence, in this and any future computations we can safely take the boundary variation to zero.Thus, we arrive at the desired result: Matching this to the definition of the gradient (16), we conclude that the gradient of the Dirichlet Energy Functional with respect to the ℓ 2 metric on M is equivalent to the Laplacian of p: An immediate corollary of this fact is that it allows us to write the standard heat equation, ∂ ∂t p t = ∆p t , in the form of a gradient flow as Remarkably, if we choose a different metric for the space of probability distributions M, we can reconcile the heat equation as a gradient flow with respect to a different functional of the probability distribution.Of particular interest to us is the task of constructing a metric for which the potential is the differential entropy: To find such a metric, we begin by establishing the following isomorphism of TM: given η ∈ T p M and η ∈ C ∞ (S), the following equation implicitly defines an isomorphism between them, up to a constant: In more familiar vector calculus notation, this equation takes the form Using this isomorphism, we may now define a new metric on M, which is essentially a probability weighted version of the Dirichlet Energy Functional.In particular, we take: As our notation suggests, this metric can be understood as the infinitesimal form of the Wasserstein two distance defined in Appendix A. A proof of this fact can be found in [4].Now, let us compute the exterior derivative of the differential entropy, and by extension its gradient with respect to the newly minted Wasserstein metric (28).Let η = δp denote an element of TM obtained by perturbing the density p.Moreover, let η denote the element isomorphic to η via (26).We may now complete the desired computation: Hence, we have succeeded in showing that: meaning that we can indeed write the heat equation: In general, if we specify a functional, F : M → R, and follow the approach described above, the resulting gradient flow equation will read: For our purposes, it will be interesting to consider functionals that are the sum of two pieces: Here, S[p] is the differential entropy; thus, this part of the functional will source a diffusion equation in (37).The second term should be interpreted as a potential and will introduce drift into (37).Provided V : S → R is a function that does not depend on p, (37) takes the form of a convection-diffusion equation: This is precisely the Fokker-Planck equation.In vector calculus notation, it takes the form: If the normalization Z = S e −V Vol S is finite, q = 1 Z e −V is a probability distribution, which is the stationary state of (39).In this case, the gradient flow of F can equivalently be understood as the gradient flow of the KL-Divergence between p t and q.This follows from a simple calculation: Because ln(Z) is independent of the distribution p, the variational derivatives of F and D KL (p t ∥ q) will be equal: δF δp = δD KL (p∥q) δp .Thus, both of these functionals provide the same (37).
We are now prepared to make good on our promise and show that the Wegner-Morris equation is equivalent to the Fokker-Planck equation with a specified stationary distribution.To this end, let us write: where Z d = S ⋆ d is the normalization factor for the Boltzmann weight of the distribution d, i.e., p = e −S p .Let us also define which are the analogs of the renormalization scheme function, and the associated reparameterization kernel.Now, we need only compute the variation of the KL-Divergence: Hence, we have shown that: Thus, the Fokker Planck equation associated with the gradient flow of F is precisely the Finite Dimensional Wegner-Morris equation:

Renormalization Group Flow as Optimal Transport
Relating RG flow to the Continuity Equation ( 53) is more or less an exercise in properly identifying the sample space of the RG probability distribution and completing the analogies that follow from there.For simplicity, we have reproduced this exercise in the form of a dictionary between optimal transport for finite sample spaces and ERG which can be found in Table 2.
Table 2.A dictionary describing the correspondence between Optimal Transport and ERG.

Optimal Transport Exact Renormalization Group
Sample Space S, a differentiable manifold F(M), the space of fields on spacetime

From Stochastic Differential Equations to Partial Differential Equations and Back Again
Having placed the Exact Renormalization Group flow equations squarely in the context of Partial Differential Equations, we would now like to take a brief detour into addressing some of the generic properties of these equations.In particular, we will be interested in understanding the relationship between continuity equations on the one hand and Stochastic Differential Equations on the other.For more information on the relationship between Stochastic Differential Equations, partial differential equations, and optimal transport, see [14][15][16][17][18][19][20][21].

Stochastic Differential Equations
A stochastic process is a one-parameter family of random variables, {X t } t≥0 .For our purposes, we shall be interested in stochastic processes whose dynamics are governed by a Stochastic Differential Equation of the Ito form [22][23][24]: Here, m i describes the drift, or mean rate of change, while σ i j describes the diffusion, or mean variability.dW i t corresponds to an independent increment of a Brownian motion, a random variable drawn from a standard normal distribution corresponding to the noisy motion of the process over a single increment of time.(We recall that the Ito formulation is one of two major approaches to the subject of stochastic calculus, with the other being the Stratonovich formalism.Here, we use Ito formulation because it makes the relationship between Stochastic Differential Equations and Partial Differential Equations very clear).
Physicists may find it useful to compare (54) to the Langevin Equation which describes the dynamics of a quantity X t governed by a law dX dt = µ but subject to random fluctuations, η [25,26].To match the Stochastic Differential Equation, we should take η t = (η 1 t , ..., η n t ) to be a random variable with the correlation structure Given a function: f : [0, T] × S → R, we can determine its stochastic differential when evaluated over the process X t by using the principal of quadratic variation.Quadratic variation dictates that the product of two increments of Brownian motion scales linearly in the interval between realizations of the stochastic process.This fact follows from the observation that the variance of a Weiner process is linear in time.Schematically, we encode the principal of quadratic variation in the form: If we now expand the differential of f (X t , t) in a power series, retaining terms up to O(dt), we find: The second-order derivative terms are the unique addition provided to us by quadratic variation.The realizations, { f (X t , t)} t≥0 , now define a stochastic process in their own right, governed by the Stochastic Differential Equation (58).
A stochastic process, {Y t } t≥0 , is called a Martingale if it satisfies the equation One can read (59) as specifying that the mean value of the stochastic process, Y t , has no tendency to change over time.Inspired by this interpretation, it is not hard to show that a stochastic process will be a Martingale if and only if it is described by a Stochastic Differential Equation with vanishing drift.This leads to a beautiful connection between the theory of stochastic processes, and harmonic analysis [27][28][29].Considering the stochastic process { f (X t , t)} t≥0 , if we demand that the drift in (58) be set equal to zero, we find that f must satisfy the Partial Differential Equation: Moreover, using (59), we see that this partial differential equation has a formal solution: where T should be regarded as the terminal time of the stochastic process.
Let us define the differential operator that appears in (60) as: With respect to the ℓ 2 inner product on the space of functions, we can define a formal adjoint operator: A † such that The adjoint operator can be deduced by integrating by parts, and we find: The differential equation generated by the adjoint operator therefore reads: This is the Fokker-Planck equation.In fact, it is slightly more general than the Fokker-Planck equation we uncovered through our analysis in Section 2.4 because it allows for a stochastic process with non-trivial diffusion matrices σ i j .To reconcile the Fokker-Planck Equation (39), we should consider the Stochastic Differential Equation: Here, we again see the role of the potential V in sourcing the mean drift behavior of the stochastic process.That is, the stochastic behavior associated with the random variable described by a probability distribution solving the Fokker-Planck equation with potential function V is itself a stochastic gradient flow with respect to that potential.The pair of differential equations we have written have interpretations as "Forward" and "Backward" processes.The equation generated by A: is called the Backward equation because its solution, i.e., (61), is specified by a terminal condition.On the contrary, the adjoint equation: is called the Forward equation because its solution is specified by an initial condition: f (0, x).

Continuity Equations
Let us now begin from the reverse perspective and seek to understand how a continuity equation might be associated with a stochastic process.A measure on S is a top form whose integral over all of S can be normalized to 1.For convenience, we will refer to such a form as µ ∈ M S with By Hodge duality, a measure can be related to a distribution: p = ⋆µ, or µ = pVol S , which would bring us back into the notation used in the previous section.We will move back and forth between the measure-based and probability-density-based approaches more or less at will.To write down a continuity equation, we consider a one-parameter family of measures, or otherwise a trajectory through the space M S , µ : [0, 1] → M S , which we denote as µ t (x) ∈ M S , governed by the differential equation: Here, v t : [0, 1] → TS is a time-dependent vector field.Since µ t is a top form, we can regard the second term above as the Lie Derivative and write: Thus, a continuity equation has the immediate interpretation as a flow generated by the vector field v t [19].Compared to (39), we see that the vector field that generates a gradient flow with respect to the potential F is given by: A measure µ t that solves the continuity equation is referred to as a strong solution.However, we can also consider a weaker condition in which the integral of (70) against a compactly supported C 1 (S) function is always zero [30].That is, given ψ : [0, 1] → C 1 (S), we demand: Integrating by parts and using the fact that ψ t is compactly supported (73) implies that To arrive at (74) we have also used the fact that where dψ t (v t ) is the pairing of dψ t and v t in the sense of forms and tangent vectors.We can interpret (74) as the condition: Provided we can exchange the order of the t-derivative and the integral over S, we can reframe our analysis in terms of a t-independent function, ϕ : S → R, in which case we find This equation can be regarded as an Ehrenfest theorem specifying the mean dynamics of the field ϕ.In particular, it says that ϕ will satisfy a gradient flow in the direction of v t , and thus we can regard (77)  Here we have used that the Laplacian operator is defined on differential forms as It is easy to show that the Laplacian of a top form, µ t = p t Vol, is related to the Laplacian of its associated scalar by: Hence, in this case (70) becomes: The fundamental solution to the heat equation with initial data p 0 is written formally in terms of the heat kernel [31,32]: with e −t∆ the operator obtained by exponentiating ∆, which acts on the initial data.Indeed, it is easy to show that: In a position basis, we can represent the heat kernel as an integral kernel of the form: Thus, we can interpret the heat kernel as the transition function for a Markov process corresponding to the probability density of translation from sample point x to sample point y in an interval 2t.This density of the form of a multivariate Gaussian, which we shall denote by H(x, y, t) = N (y, 2tI) (N (µ, Σ), is the multivariate normal distribution with mean parameter µ and covariance matrix Σ).
A general continuity equation will be generated by a differential operator A, through the equation: We can play the same game as before and solve this differential equation formally by writing p t = e −tA p 0 .Again, we can choose a continuous basis and express the heat kernel of the operator A as an integral kernel: It remains natural to interpret H(x, y; t) as the transition probability density for a Markov process [33,34].The Markov process in question is defined by a continuous set of operators, {P t } t≥0 .One can regard the operator P t as the "time evolution operator" associated with A, which, in this context, should be understood as the infinitesimal generator of time translations.In other words, P t is obtained by exponentiating the operator A, P t = e −tA .
Given any measurable function, f : S → R, the action of the operator P t translates f along the Markov chain.If we work in the positional basis, we can write: which is simply the convolution of the transition density H with f .This provides us with a very useful formula for determining the operator A: Equation ( 87) reverses the operator exponentiation by differentiating along the integral curve P t ( f ).
The family of operators {P t } t≥0 can be regarded as a semigroup with the simple composition law [35]: Moreover, the set of operators {P t } t≥0 defines a stochastic process {X t } t≥0 , satisfying the law [17]: This expression is equivalent to the Martingale condition (61).Writing the operator A in the form (62), the stochastic process {X t } t≥0 is immediately identified with the Stochastic Process: Thus, we have succeeded in mapping our way back to a Stochastic Differential Equation, this time beginning from a diffusion equation.

Dynamical Bayesian Inference and the Backward Equation
The formulation of the Exact Renormalization Group flow as an Optimal Transport problem, particularly in its relationship to the extremization of relative entropy, already suggests a deep connection between renormalization, diffusion, and information theory.We now turn our attention to the task of fleshing out this relationship and, in doing so, providing an information-theoretic conceptualization of renormalization through the intermediary of statistical inference.To accomplish this task, we will build on the language of Dynamical Bayesian Inference introduced in [6].We will show that the dynamics of an inferred probability model defined by a continuously updating Bayesian scheme give rise to an inverse process governed by a diffusion equation that can be brought into correspondence with ERG.Using this fact, we advocate for the perspective that renormalization can be understood as the inverse process of statistical inference.
We should emphasize that inversion is understood in the statistical sense as a 'reverse time' stochastic process relative to the Kolmogorov forward process defined by ERG [36].Such stochastic processes have recently appeared in diffusion learning, where the reverse time Stochastic Differential Equation relative to a forward diffusion process is modeled via a score-based generative algorithm [37].From the point of view developed in this section, a score-based generative algorithm can be interpreted as implementing a form of Bayesian inference.We will revisit this and other related points in Section 6.

Bayesian Inversion
Bayesian inference is a probabilistic, self-consistent approach to adjusting one's beliefs in the presence of new information.It originates from Bayes' Law, which encodes the probabilistic relationship between two, potentially co-varying, random variables H ∈ S H and E ∈ S E : In Bayesian Inference, the variables H and E are respectively identified with the Hypothesis and the Evidence.In this language, one can attach a compelling geometric interpretation to Bayes' law as measuring the relative volumes of the region H ⊂ S H × S E , for which both H and E are realized, and the region O ⊂ S E , for which the observed evidence is true.Thus, the interpretation of Bayesian Inference is that it measures the volume of hypotheses that is consistent with a given set of observed data, which we denote as the region H | O ⊂ S H × S E .
A particularly interesting form of Bayesian Inference arises in the context of so-called Bayesian Inversion problems [5,[38][39][40][41][42][43].There, we are concerned with a pair of random variables, Y ∈ S Y and U ∈ S U , which are related by the following process: We regard G : S U → S Y as a deterministic function, denoted by N ∈ S N ≃ S Y noise, which can either be associated with the process itself or the process of measuring its output.
The goal of the Bayesian Inversion problem is to determine the signal, u, that led to the measured output, y.However, it is often impractical to truly invert the process; thus, it is more prudent to seek to place a probability distribution over the inputs.
To this end, we regard U ∈ S U as having a prior measure Π 0 = π 0 (u)Vol U (u) ∈ M S U and treat the noise N as a random variable independent from Y also possessing of a measure ρ 0 = p 0 (n)Vol N (n) ∈ M S N .Then, we can define the conditional random variable Y | U as: If we regard ϕ : S Y → S N as a map from output to noise, such that we can obtain a conditional measure for Y | U by pulling back the measure ρ 0 via the map ϕ: We regard the pulled-back density p 0 (y − G(u)) as the likelihood function for an inference problem and often denote it by p Y|U (y | u).We therefore obtain a joint measure: for the random variable (Y, U) ∈ S Y × S U .Provided the marginalization: is greater than zero and finite, we can define the Bayesian posterior measure as: which is just a recapitulation of Bayes' law as written in the form (91).The notation Π y * (u) is meant to remind us that the posterior distribution depends on the realization of observed data y * ∈ S Y .We can also recognize p Y (y) as the marginal density for the random variable Y modulo the prior distribution π 0 .Next, we define the Bayesian Potential as: which is nothing but the negative log likelihood.With this potential in hand, we can write the Bayesian Inference condition suggestively as: On the left hand side, we have the Radon-Nikodym derivative of the posterior measure with respect to the prior measure.On the right hand side, we have what we would like to interpret as the stationary distribution modulo observations y.This equation can be interpreted as follows: given any measurable function f : meaning that the expectation value with respect to the posterior measure is the same as the expectation value with respect to the prior provided the function is augmented by the evidence in the form of the stationary distribution e −Φ(u;y) p Y (y) .

Dynamical Bayes
Dynamical Bayesian Inference is an extension of the conventional approach to Bayesian Inference in which one implements Bayesian inference as an iterative process where swathes of evidence are collected, and the posterior distribution at the end of a given iteration is used as the prior distribution in the following iteration [6].
Using such an approach, one can regard Bayesian Inference as a dynamical system governed by a first-order differential equation.To begin, we define a timelike variable, T, which essentially corresponds to the total number of data point observed.Inference up to the "time" T therefore leverages evidence from the set of observed data {y t } 0≤t≤T , which we can regard as a continuous time stochastic process coming from the data-generating measure µ * Y (y In what follows we shall regard the data-generating measure as belonging to the parametric family p Y|U , associated with the true underlying signal value u * .That is, p * Y (y) = p Y|U (y | u * ).The equation satisfied by π T (u) is then given by: where is the KL-Divergence between a distribution p on S Y and the data-generating distribution p * Y .As an aside, we note that (102) is equivalent to the replicator dynamic which appears in evolutionary game theory provided the fitness of a particular model p Y|U (y | u) is taken to be minus its KL-Divergence with the data-generating model.We refer the reader to [44,45] for more detail.
Among the most significant insights of this approach is that, if we consider models that are within an ϵ neighborhood of the true underlying signal, the 2n-point functions described by Π T are very well approximated by power laws: Here, I ij is the inverse of the Fisher Metric arising from the family of distributions p Y|U , evaluated at the data-generating parameter u * .

Bayesian Diffusion for Normal Data
To illustrate the Dynamical Bayesian Inference dynamic, it will be beneficial to work through the problem of performing Bayesian inference on the mean of normally distributed data with known variance, σ 2 .In this case, we take: where n is noise distributed according to a distribution N (0, σ 2 ): and G(u) = µ is a random draw from the distribution over means for the data y so that p 0 (y − µ) = N (µ, σ 2 ) (for notational simplicity, we shall denote this distribution by p(y | µ)).The data-generating distribution is taken to belong to the same parametric family of distributions, only with a fixed but unknown "true" underlying mean parameter The governing equation of Dynamical Bayesian Inference can be solved formally as: Here, we have used an abbreviated notation in which D(u) = D KL (u * ∥ u) is the KLdivergence between two distributions of the same paramteric form with given parameter values u.Notice that only the first exponential depends on the variable u; hence, we conclude that the role of the second exponential is simply to maintain the normalization of π T as a probability distribution.Thus, we can write: where In the case of the normal model with fixed variance, the KL-divergence is given by The T-Posterior is therefore given by: It is straightforward to determine the normalization of this distribution by performing the requisite integral that is now Gaussian.When all is said and done, the T-dependent posterior density is of the form: Let us now compare (111) to the standard density of a length t increment of a Weiner process with diffusivity parameter σ: Recall that the density f W t (x) solves the heat equation: Following the analysis of Section 3, one can also recognize the stochastic process { f W t (X t )} t≥0 as a martingale adapted to the Weiner process We therefore recognize (111) as describing a shifted Brownian motion for the mean parameter with diffusivity σ in the "time" parameter τ = 1 T : This observation lends credence to the idea that Bayesian inference can be associated with a diffusion process.Moreover, it provides an important insight: Bayesian Diffusion ensues backwards with respect to the performance of Bayesian Inference, in the timelike parameter, τ, which is the inverse of the Bayesian time, T, originally introduced.
Given the τ-posterior π τ , and terminal data such as the parametric form of the datagenerating distribution, one can obtain the τ path of the posterior predictive distribution for future data by marginalizing.For the normal model, this means: which we can recognize as the convolution of π τ and p 0 (y) and interpret as the action of the Green Function of the Heat Operator translating the model forward in τ (and backward in T).

Bayesian Drift and Scheme Independence
We have now shown that the solution to the Dynamical Bayesian inference equation, (111), is also the solution to the standard diffusion equation when viewed as transforming backwards relative to the update time, T. As we shall now discuss, we can promote (111) to a solution to a drift-diffusion equation by an analogous argument to the one that appeared in Section 2.3.In particular, we argue that drift in the context of Bayesian diffusion is also associated with a redundancy of description, in this case related to the specific sequence with which data are observed.
The reasoning behind this argument is most easily understood through an example.Suppose again that one is performing Bayesian inference on a system is tthataken to follow a normal distribution with known variance, σ 2 , but an unknown mean.To deduce the mean of the distribution governing the system, we observe a sequence of N-independent, identically distributed random draws from the true underlying distribution, E = {Y 1 , ..., Y N }.Starting with a normal prior, one finds that the mean of the posterior distribution after observing the first n pieces of data shall be given by: Let π ∈ Perm(N) be a permutation, and let π(E) = {Y π(1) , ..., Y π(N) } denote the same set of evidence but in a new order defined by π.We interpret this transformation as changing the sequence in which the data are obtained.Crucially, π(E) is still a set of N-independent, identically distributed random variables drawn from the same datagenerating distribution.If we compute the maximum likelihood estimate based on the first n piece of data appearing in π(E), we find: and hence it is true that π(µ(E n )) ̸ = µ(E n ).However, if we compute the maximum likelihood estimate on account of all of the data contained in either set, we find: That is, the terminal maximum likelihood estimate is invariant with respect to the sequence in which the data are incorporated into the model.We extrapolate this observation to the statement that the posterior distribution can be assigned an arbitrary path through the space of probability models, provided the terminal distribution remains consistent with the large observation limit, that is, the central tendency towards the data-generating distribution.This is completely analogous to the role of drift in defining an ERG scheme: the invariant definition of an ERG flow is given in terms of the IR fixed point it describes, and the path through the space of theories by which the theory moves from the UV to the IR is scheme-dependent.
Operationally, we make use of the sequencing freedom in the Bayesian inference to "seed" the flow with information about the individual inference trajectory by specifying the τ path of the maximum a posteriori estimate (MAP).In the general case where we are given a set of signal parameters u ∈ S U , we specify the trajectory of the MAP as a flow on the manifold S U generated by a vector field V : S U → TS U .That is, such that or, in more standard notation, The trajectory of the MAP arises from maximizing the log-likelihood of the datagenerating model.Thus, in many cases the MAP path, γ τ will realize a gradient descent: where I ij are the matrix components of the inverse Fisher metric, and we have implemented the chain rule to evaluate the derivative.It is very crucial to note that the gradient descent here ensues in the direction of increased data observation, that is, as T → ∞, γ approaches the data-generating parameter value, as desired.

Generic Bayesian Diffusion at Late T
The Normal Model is significant because it arises as the late T limit of any Dynamical Bayesian Inference scheme for which the parameters of the data-generating distribution are equal to some fixed values.This observation arises naturally as an asymptotic limit of the solution to (102) and is a statement of the Central Limit Theorem.At late T, the KL-Divergence can be approximated by the quadratic form: meaning we can write the unnormalized posterior distribution as: Provided I does not depend on u, the normalization is obtained by performing a Gaussian integral, and we can write the posterior distribution as: Here, we have changed the variables to τ in order to observe that this is the distribution of a multi-variate Brownian motion.
Using the gauge freedom discussed in the last section, we can promote this solution to one of the forms: where γ τ is the trajectory of the MAP.As advertised, both (126) and ( 127) are consistent with the late T statistics (104).

A Partial Differential Equation for Bayesian Inference
In light of the previous we shall now show that one can derive a convectiondiffusion equation describing the evolution of the posterior predictive distribution when it is updated according to Bayes' law.As we have come to recognize, the sense in which Bayesian inference describes a diffusion process is in moving backwards relative to the observation of new information.We will therefore work in the time parameter τ, defined as the inverse to the time parameter T that tracks the amount of data observed.Given the time τ posterior distribution, which we have argued is of the form of a modified heat kernel, we can define the time τ posterior predictive model for future data p τ by marginalizing over the likelihood model: In Section 3, we reviewed the relationship between the solution to a conventiondiffusion equation and the convolution of given boundary value data with a heat kernel.Comparing (128) with (86), it is natural to regard the posterior distribution as a Markovian transition kernel measuring the probability of going from y ∈ S Y to y − G(u) ∈ S Y .From this perspective, we define the set of operators {P τ } τ≥0 such that: Following the approach outlined in (87), we can now deduce the diffusion equation generated by the semi-group {P τ } τ≥0 .By Taylor Expansion, we can compute: Here, we regard ui τ as corresponding to the infinitesimal flow of the parameter u i in terms of the vector field γτ generating the flow of the MAP.That is, ui τ = δu i .Following this interpretation: We can now use (104) to compute the expectation values explicitly to the order specified in the expansion.Because we are describing the diffusion in terms of the variable τ = 1/T and we shall eventually be taking the limit as τ → 0, we can use the T → ∞ results described in Equations ( 104) and (127).In particular, we make use of the fact that Meaning, we can write: We can rewrite these terms in a slightly more illuminating way by writing: Moreover, if we assume the MAP follows a gradient flow with respect to the log-likelihood, we can also write: Putting everything together, we have therefore shown that: Thus, taking the limit τ → 0 we obtain the result: This is precisely of the form of the Kolmogorov equation for a diffusion process with potential V = Φ(γ τ ; y) = Φ τ !It is tempting to interpret the matrix K ij as being associated with the induced Fisher metric in the space of probability distributions along the path of the MAP.From this perspective, we can regard Y τ , the random variable associated with the T-dependent posterior predictive distribution, as a stochastic process on a curved space specified by the Stochastic Differential Equation: The potential is given by Φ τ , which is minus the log-likelihood associated with the time τ maximum likelihood estimate for the generating distribution.Thus, we have shown that a Dynamical Bayesian inference induces a gradient flow with respect to the log-likelihood of the data-generating distribution.

The ERG Flow/Dynamical Bayesian Inference Correspondence
We are now prepared to build the dictionary relating ERG flow and Bayesian Inference.For simplicity, we shall consider one-parameter families of probability distributions on finite dimensional sample spaces; however, it is a simple exercise to generalize these insights to the infinite dimensional case as well.To begin, let us review the work we have presented in the previous sections.
In Section 2, we recalled the Wegner-Morris formulation for ERG and demonstrated that it is equivalent to a one-parameter family of probability distributions described by a gradient flow with respect to the relative entropy: In this equation, we interpret the time parameter t as a logarithm of the scale -t = ln Λ.The distribution q t = qt /Z q , with qt = e −V for the potential V, can be regarded as specifying the ERG scheme through the fixing of a stationary point along the flow generated by (139).
The choice of V defines the functional, Σ = − ln pt qt , and the reparameterization kernel, Ψ = grad g Σ; hence, it is equivalent to a choice of scheme in the standard Wegner-Morris sense.As we have reviewed in Section 3, the Fokker-Planck equation is equivalent to the Kolmogorov Forward equation for the Stochastic Process governed by the Stochastic Differential Equation: This is intuitively satisfying since such an equation describes a stochastic gradient descent of the potential function V. Once initial data are supplied in the form of a UV theory, p 0 , (139) completely describes an ERG flow terminating at an IR fixed point.
In Section 4, we introduced the notion of Dynamical Bayesian Inference.Dynamical Bayesian Inference describes a one-parameter family of probability distributions obtained by implementing Bayes' law using data collected from a continuous time stochastic process.To quantify this family of distributions, we introduced the "time" parameter, T, which corresponds to the number of data incorporated into the model.In the direction of increasing T, the inferred probability model converges onto the distribution generating the observed data.In this respect, a Dynamical Bayesian flow has the complexion of an inversion in the ERG sense because it begins with an uninformed prior distribution but eventually converges to an informed distribution.In the language of ERG, this describes a flow from an IR theory to a UV theory.The pair consisting of an uniformed prior, along with the specification of a sufficiently complete set of data, therefore defines a flow terminating with a UV theory.We take this as our definition of a Dynamical Bayesian Flow.
This picture suggests that Dynamical Bayesian Inference and ERG flow are inverse of each other.If we can find a Dynamical Bayesian inference that begins with a prior distribution equal to the IR fixed point of an ERG flow, and which terminates at a datagenerating distribution equal to the UV initial data of said ERG flow, we will have obtained an inversion of the ERG flow.Let us now describe a strategy for determining pairs of ERG flows with Dynamical Bayesian Flows.

Dynamical Bayesian Flow → ERG Flow
Suppose we are given a Dynamical Bayesian flow and asked to determine an ERG flow that is inverse to it.Recall, we have defined a Dynamical Bayesian Flow as the pair, (q 0 , {Y t } T t=1 ), where q 0 is an uninformed prior distribution and {Y t } T t=1 is a set of data generated by a distribution p * .Using these data, it is straightforward to define the corresponding ERG flow as the one-parameter family of probability distributions obtained from the Dynamical Bayesian flow when viewed as evolving backwards with respect to the time parameter T. In other words, one takes the terminal distribution of the Dynamical Bayesian flow, p * , as defining the UV initial data of the ERG flow and obtains subsequent probability distributions along the ERG flow by removing items of data from the inferred model.
In Section 4.6, we derived a partial differential equation governing just such a situation, in which the posterior predictive distribution evolves backwards against the collection of new data.The resulting partial differential equation describes a diffusion process that we dubbed Bayesian Diffusion: that is, one obtains a one-parameter family of probability distributions {p τ } such that p 0 = p * and for which: This differential equation describes the evolution of a probability model governing the stochastic process Y τ , which itself is governed by the Stochastic Differential Equation Here, γ τ is a trajectory in the model space of the theory describing the path of the maximum a posteriori parameter estimate generated by the sequence of observed data, K is the pullback of the Fisher Information Metric on the space of models defined in (134), and Φ is the log-likelihood function associated with the Bayesian Inference scheme.Together, the aforementioned items constitute a choice of scheme for the Dynamical Bayesian flow.Bayesian diffusion can therefore be associated with an ERG flow governed by the Wegner-Morris equation: where now the stationary distribution, scheme function, and reparameterization kernel are, respectively, given by: These data, along with the distribution p * , define an ERG flow in the Wegner-Morris sense, which, by construction, is the inverse of the Dynamical Bayesian flow we began with.

Dynamical Bayesian Flow ← ERG Flow
Conversely, suppose that we are given an ERG flow in a Wegner-Morris form and asked to determine a Dynamical Bayesian flow that is inverse to it.To do so, we first identify the Bayesian Diffusion process that the ERG flow corresponds to.Since we have shown that ERG and Bayesian Diffusion are governed by the same equations, this is as simple as translating between the ERG scheme and the Dynamical Bayesian scheme.Comparing (139) and (143), we deduce that the ERG associated with an optimal transport with potential V and sample space metric g is equivalent to a Bayesian diffusion in which V is taken as the log-likelihood function, and g is identified with the metric K.These data are sufficient to define a Bayesian inference problem.Notice, if we go all the way back to (13), we can finally provide a conceptual understanding of the seed action: the seed action sets the log-likelihood for the Bayesian Inference scheme related to the ERG flow it defines.

A Dictionary
We summarize the analysis of this final section in Table 3.It provides a dictionary translating between the Wegner-Morris equations relevant to finite sample space ERG, infinite dimensional sample space ERG, and Bayesian Diffusion.

Renormalizability and Scale
The dictionary in Table 3 provides a blueprint for interpreting ERG in the language of Statistical Inference.In this role, our work suggests new approaches to understanding and resolving many interesting problems inside and outside of field theory.As a demonstration, we shall use this final section to discuss the meaning of renormalizability in the context of an ERG flow related to the Bayesian Inference paradigm.
An important observation in (3) is the correspondence between the Fisher Metric in the inference context and ĊΛ (x, y) in the ERG context.In the exact renormalization of a free field theory, ĊΛ (x, y) is the regulated two-point function and therefore sets a running momentum scale for operators in the theory.The Fisher Metric, I, plays an analogous role in the Bayesian Inference scheme as a generalized two-point function encoding a notion of scale through the covariance between operators.
The interpretation of the Fisher Metric as defining an energy scale is made very clear when we consider the inverse Bayesian flow as a diffusion process.Let M = dens(S) denote the manifold of probability distributions over a sample space S.Then, a Bayesian diffusion, or equivalently an Exact Renormalization Group flow, can be described by a drift-diffusion process generated by an operator L : M → M, such that: Given initial data (for example, in an ERG we would provide a UV theory) p 0 ∈ M, we can write the solution to (145) symbolically as: where e −tL is the heat kernel of L. To give more concrete meaning to (146), let us assume that L is a positive definite operator that can be diagonalized as Here, {ψ n } is a countably infinite set forming a basis for M, and λ n is non-negative but may be equal to zero.An arbitrary element p ∈ M can be expanded as a series: p = ∑ n p n ψ n , where p n is the coordinate of p in the eigendirection ψ n .The spectrum of L defines an emergent energy scale in the following sense: consider the action of (146) as given by: (148) dictates that the projection of p t onto each mode, ψ n , is damped over time with a strength determined by on the "energy" λ 2 n .At time t, the effective description exponentially suppresses modes that have large eigenvalues relative to the operator L and thus sequentially removes these modes in a generalized integration over effective "momentum shells".From Equation (137), we can see that the operator L is generically a convectiondiffusion operator with a diffusion matrix given by the Fisher Information.Through further comparison to (6), we see that in ERG for physical theories the analogous role is played by the regulated two-point function.This leads to the important conclusion that, in physical contexts, the emergent energy scale is in fact equivalent to a physical one.
More generally, interpreting the two-point or the covariance matrix, in a statistical inference problem as generating an emergent energy scale provides the foundation for an information-theoretic interpretation of the conditions for renormalizability.In Wilsonian RG, a theory is said to be non-renormalizable if the divergences present in higher-order Feynman diagrams can only be canceled by the introduction of an infinite number of arbitrarily high-energy couplings [46].By contrast, a theory is said to be renormalizable if arbitrarily higher-order Feynman diagrams can be computed by introducing only a finite number of operator sourced counterterms.
In the language of statistical inference, the operator content of a theory is related to the problem of model selection [47,48].In the context of parametric statistics (as we have mentioned in the introduction, the relationship between Wilsonian RG and ERG is analogous to the relationship between Parametric and Non-Parametric statistics), model selection can be reduced to determining the set of sufficient parameters needed to form a model that can accurately compute the expectation values of any observable associated with the system of interest.A natural framing of this problem is given in terms of n-point correlation functions for the random variable, Y, observed throughout the inference.It is sensible to restrict our attention to n-point functions since arbitrary observables can be constructed from them using a Taylor expansion.Put differently, a probability distribution can be reconstructed with knowledge of all its moments.
In view of the previous discussion, higher n-point functions can be interpreted as encoding information at higher values of the emergent energy scale.This inspires the interpretation that an inference model is "renormalizable" if a finite N exists such that for any n > N, the n-point function can be computed with the information contained only in m-point functions with m ≤ N. In other words, an energy as measured through L, λ 2 * exists, above which all of the information in the theory is actually encoded in lower energy operators.This happens, for example, in a Gaussian theory in which all n-point functions higher than N = 2 can be formulated as sums of products of 2-point functions using Wick's theorem.If no such finite N exists, the inference problem is "nonrenormalizable".A nonrenormalizable theory can therefore be understood as a theory in which an infinite number of n-point functions will be required to compute the expectation values of arbitrary observables.In other words, there is no energy scale above which information becomes encoded in the energy scales below it-every energy scale contributes, in some sense, independently to the theory.

Discussion
In this note, we have demonstrated that an ERG flow can be identified with a diffusion process that is inversely related to a Dynamical Bayesian Inference scheme.In particular, we have argued that ERG flow can be understood as a one-parameter family of probability distributions arising where data are continuously removed from the inferred probability model.We have motivated this interpretation by illustrating that the equations governing ERG and Bayesian Diffusion can be brought into direct correspondence with one another, as outlined in Section 5.The resulting dictionary provides a novel, fully informationtheoretic language for understanding ERG flow.It also provides an operational answer to the question of what it means to "invert" an ERG flow.
From a very general perspective, the solution to this problem can be framed in the following way.Given a preliminary probability distribution, p 0 , we imagine running our model through a noisy channel generated by a diffusion operator B. In other words, we produce a probability distribution, p τ , which solves the differential equation: with initial data p 0 .After a given period of time, t, we obtain a new probability distribution, p t = e −tB (p 0 ), which has lost some of the information previously contained in p 0 to diffusion.In the case of an ERG flow viewed from the functional diffusion perspective, we can regard this of information as being generated by a coarse-graining scheme encoded in the operator B.
We then ask the question, can this diffusion process can be "inverted"?Since exact inversion may not be possible, we frame this problem in the form of an optimization scheme.Consider the set F consisting of all operators F, generating a one-parameter flow of probability distributions, q T , such that: subject to an initial condition q 0 .If we take the initial data of this process to be the terminal distribution of the diffusion process given by (149), q 0 = e −tB (p 0 ), we can interpret the solution q t = e −tF (q 0 ) as a reconstruction algorithm for the initial data, p 0 .It is natural to identify the optimal reconstruction algorithm as the operator F * ∈ F , for which the relative entropy between the reconstructed distribution, q t , and the initial data, p 0 , is minimal: In this language, we interpret the main result of our paper as dictating that, given an ERG generated by a diffusion operator B, the optimal reconstruction operator F * corresponds to a continuous Bayesian Inference scheme in which the information lost to coarse-graining is re-learned and hence reincorporated into the model.This clarifies the sense in which an ERG is "invertible" as long as we allow for the reconstruction of information ostensibly destroyed by diffusion.
One can visualize this process as follows: imagine an experimenter performing a statistical inference experiment in which they observe a collection of data {Y i } T i=1 , generated from the distribution p 0 .Next, imagine we can place each of the observations along the real axis, distinguishing a series of points, each of which we label by a probability distribution {p T } ∞ T=0 .The probability distribution at the Tth point, p T , is obtained by incorporating all of the data to the left of T into a model using Bayes' law.Moving to the left along this axis corresponds to disincorporating data from the model and therefore induces a diffusion process and by extension an ERG scheme.Conversely, moving to the right along this axis corresponds to reincorporating lost data and therefore inverts the ERG flow.
Framing the relationship between ERG and Statistical Inference in terms of the reconstruction problem (151) suggests several interesting paths for future study.Firstly, the reconstruction problem is equivalent to a common problem encountered in Machine Learning when one wishes to sample data from an analytically intractable distribution, p 0 .An approach to this problem goes by the name of Diffusion Learning [49][50][51][52].Diffusion learning is a two-step process: first, one uses a diffusion operator, B, with a known fixed point to transform the initial data p 0 into an analytically tractable form.Then one identifies a second diffusion operator, F, which optimally reconstructs the initial data without sacrificing analytic tractability.This routine is equivalent to (151), provided we restrict the set of allowed reconstruction algorithms to operators that generate diffusion processes.More generally, the information-theoretic formulation of ERG constructed in this paper renders renormalization in a form that is amenable to applications outside of pure physics.We hope this will catalyze continued work, especially at the intersection of physics and data science, geared towards constructing and better understanding machine learning algorithms like diffusion learning.Since the original drafting of this note, some work to this end has been undertaken.In [53], the approach introduced in this paper was adapted into a practical renormalization scheme for generic statistical inference models including neural networks.As a proof of concept, this scheme was subsequently applied to construct renormalization group flows for autoencoder neural networks.
A second fascinating implementation of (151) appears in the study of Holography [54].There, one is interested in reconstructing a bulk spacetime from the data contained in a quantum field theory on its conformal boundary [55,56].The relationship between our

3 .Finite
Dictionary relating ERG and Bayesian Diffusion.
(78)he expectation value of the Stochastic Differential Equation(54).The heat equation is a special case of the continuity equation with the flow vector given by = grad(ln(p t )): to see that this is the case, let us write µ t = p t Vol.Then, it is straightforward to show that = di gradp t Vol S = div(gradp t )Vol S = ∆p t Vol S = ∆µ t .(78)